These are variables - do you know what they mean?
TGW - yep its a thing
ODO - what do you think it is?
NO3 - what is it? Are you sure? Why might you get in legal trouble if you used this?
We’ll use a dataset on grayling fish from two different lakes to explore these concepts..
Biology is fundamentally different from fields like physics in that:
Statistics helps us understand biological processes in this variable world by:
Practice Exercise 1: Can you do this for the pine data we have collected?
Before we dive into descriptive statistics, let’s clarify some fundamental concepts:
Types of populations: -
Sampling involves
It’s important to distinguish between:
For example:
The standard deviation formula above includes n-1 in the denominator (rather than n) to provide an unbiased estimate of the population parameter.
Understanding the type of variable you’re working with is essential for selecting appropriate statistics:
asdfsd
Derived Variables
Let’s explore our grayling fish dataset and identify the types of variables it contains.
asf
Practice Exercise 2: Can you do this for the pine data we have collected?
Let’s examine the different data and determine what they are?
# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!
# Load the grayling data
grayling_df <- read_csv("data/gray_I3_I8.csv")Rows: 168 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): lake, species
dbl (3): site, total_length_mm, mass_g
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
site lake species total_length_mm mass_g
<dbl> <chr> <chr> <dbl> <dbl>
1 113 I3 arctic grayling 266 135
2 113 I3 arctic grayling 290 185
3 113 I3 arctic grayling 262 145
4 113 I3 arctic grayling 275 160
5 113 I3 arctic grayling 240 105
6 113 I3 arctic grayling 265 145
Rows: 168
Columns: 5
$ site <dbl> 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113,…
$ lake <chr> "I3", "I3", "I3", "I3", "I3", "I3", "I3", "I3", "I3", …
$ species <chr> "arctic grayling", "arctic grayling", "arctic grayling…
$ total_length_mm <dbl> 266, 290, 262, 275, 240, 265, 265, 253, 246, 203, 289,…
$ mass_g <dbl> 135, 185, 145, 160, 105, 145, 150, 130, 130, 71, 179, …
When taking biological measurements, understanding measurement quality is essential:
Accuracy is a function of both precision and bias. For statisticians, bias is usually a more serious problem than low precision because:
It’s harder to detect (true value usually unknown)
Low precision can be compensated for by increased sample size
asdfsd
Practice Exercise 1: What are potential sources of error in pine needles or fish?
For our grayling data, potential sources of measurement error might include:
Precision issues:
Bias issues:
Accuracy issues? what could they be?
The two most common measures of central tendency are the mean and the median.
The Arithmetic Mean The arithmetic mean is the average of a set of measurements:
Where:
\(Y_i\) represents each individual measurement
\(n\) is the total number of observations
stuff
# Calculate mean length of all fish
mean_length <- mean(grayling_df$total_length_mm)
cat("Mean length of all fish:", round(mean_length, 1), "mm\n")Mean length of all fish: 324.5 mm
# Calculate mean by lake
grayling_df %>%
group_by(lake) %>%
summarise(mean_length = mean(total_length_mm, na.rm=TRUE)) %>%
kable(caption = "Mean length by lake", digits = 1)| lake | mean_length |
|---|---|
| I3 | 265.6 |
| I8 | 362.6 |
The Median
# Calculate median length of all fish
median_length <- median(grayling_df$total_length_mm)
cat("Median length of all fish:", median_length, "mm\n")Median length of all fish: 324.5 mm
# Calculate median by lake
grayling_df %>%
group_by(lake) %>%
summarise(median_length = median(total_length_mm)) %>%
kable(caption = "Median length by lake", digits = 1)| lake | median_length |
|---|---|
| I3 | 266 |
| I8 | 373 |
The spread of a distribution tells us how variable the measurements are.
The variance is
The standard deviation is the square root of variance
# Calculate standard deviation of fish length
var_length <- var(grayling_df$total_length_mm)
sd_length <- sd(grayling_df$total_length_mm)
cat("Variance of length:", round(var_length, 1), "mm²\n")Variance of length: 4225.9 mm²
Standard deviation of length: 65 mm
# Calculate by lake
grayling_df %>%
group_by(lake) %>%
summarise(
var_length = var(total_length_mm),
sd_length = sd(total_length_mm)
) %>%
kable(caption = "Standard deviation and variance by lake", digits = 1)| lake | var_length | sd_length |
|---|---|---|
| I3 | 801.1 | 28.3 |
| I8 | 2739.4 | 52.3 |
The area under the curve of a bell shaped curve within + and - 2 Standard deviations on each side includes about 95% of the data
# Read the data
fish_data <- read.csv("data/gray_I3_I8.csv")
# Filter data for I8 lake
i3_data <- fish_data %>%
filter(lake == "I3")
# Calculate statistics
mean_length <- mean(i3_data$total_length_mm)
sd_length <- sd(i3_data$total_length_mm)
# Calculate the bounds for standard deviations
minus_2sd <- mean_length - (2 * sd_length)
plus_2sd <- mean_length + (2 * sd_length)
# Calculate percentage of data within 2 SD
percent_within_2sd <- 100 * mean(
i3_data$total_length_mm >= minus_2sd &
i3_data$total_length_mm <= plus_2sd
)
# Create the plot
ggplot(i3_data, aes(x = total_length_mm)) +
# Add density curve
geom_density(fill = "skyblue", alpha = 0.6) +
# Add vertical line for mean
geom_vline(xintercept = mean_length, color = "navy", linewidth = 1) +
# Add vertical lines for +/- 2 SD
geom_vline(xintercept = minus_2sd, color = "darkred", linewidth = 0.8, linetype = "dashed") +
geom_vline(xintercept = plus_2sd, color = "darkred", linewidth = 0.8, linetype = "dashed") +
# Highlight area within 2 SD
annotate("rect",
xmin = minus_2sd, xmax = plus_2sd,
ymin = 0, ymax = Inf,
fill = "lightgreen", alpha = 0.3) +
# Add annotations
annotate("text",
x = mean_length, y = 0.010,
label = paste0("Mean = ", round(mean_length, 1), " mm"),
color = "navy", fontface = "bold", size = 4) +
annotate("text",
x = mean_length, y = 0.009,
label = paste0("SD = ", round(sd_length, 1), " mm"),
color = "darkred", size = 3.5) +
annotate("text",
x = mean_length, y = 0.008,
label = paste0(round(percent_within_2sd, 1), "% within ±2 SD"),
color = "darkgreen", size = 3.5) +
# Add labels for SD boundaries
annotate("text",
x = minus_2sd, y = 0.002,
label = paste0("-2 SD (", round(minus_2sd, 1), ")"),
color = "darkred", angle = 90, hjust = 0, size = 3) +
annotate("text",
x = plus_2sd, y = 0.002,
label = paste0("+2 SD (", round(plus_2sd, 1), ")"),
color = "darkred", angle = 90, hjust = 0, size = 3) +
# Add title and labels
labs(
title = "Distribution of Fish Lengths in i3 Lake",
subtitle = "Area between dashed lines represents ±2 standard deviations from the mean",
x = "Total Length (mm)",
y = "Density",
caption = paste0("n = ", nrow(i3_data), " fish")
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11),
axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
) +
# Set x-axis limits to show the full range plus a bit of padding
xlim(min(i3_data$total_length_mm) - 10, max(i3_data$total_length_mm) + 10)i3 Lake Fish Length Summary:
Number of fish: 66
Mean length: 265.61 mm
Standard Deviation: 28.3 mm
Range for ±2 SD: 209 to 322.21 mm
Percentage within ±2 SD: 90.91 %
The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean:
This is useful for comparing the variability of measurements with different units or vastly different scales.
stuff
# Calculate coefficient of variation
cv_length <- sd_length / mean_length * 100
cat("Coefficient of variation:", round(cv_length, 1), "%\n")Coefficient of variation: 10.7 %
# Calculate by lake
grayling_df %>%
group_by(lake) %>%
summarise(
cv_length = sd(total_length_mm) / mean(total_length_mm) * 100
) %>%
kable(caption = "Coefficient of variation by lake", digits = 1)| lake | cv_length |
|---|---|
| I3 | 10.7 |
| I8 | 14.4 |
The interquartile range (IQR) is the range of the middle 50% of the data:
\[IQR = Q_3 - Q_1\]
Where \(Q_1\) is the first quartile (25th percentile) and \(Q_3\) is the third quartile (75th percentile).
stuff
# Calculate quartiles and IQR
q1_length <- quantile(grayling_df$total_length_mm, 0.25)
q3_length <- quantile(grayling_df$total_length_mm, 0.75)
iqr_length <- IQR(grayling_df$total_length_mm)
cat("First quartile:", q1_length, "mm\n")First quartile: 270.75 mm
Third quartile: 377 mm
Interquartile range: 106.25 mm
# Calculate by lake
grayling_df %>%
group_by(lake) %>%
summarise(
q1 = quantile(total_length_mm, 0.25),
q3 = quantile(total_length_mm, 0.75),
iqr = IQR(total_length_mm)
) %>%
kable(caption = "Quartiles and IQR by lake", digits = 1)| lake | q1 | q3 | iqr |
|---|---|---|---|
| I3 | 256 | 280 | 24 |
| I8 | 340 | 401 | 61 |
Biological data are often skewed (asymmetrical), which can make the arithmetic mean less representative of central tendency. Data transformations can help address this issue.
The logarithmic transformation is one of the most common for right-skewed biological data:
When data are log-normally distributed, the geometric mean often provides a better measure of central tendency than the arithmetic mean.
# Add log-transformed length to our dataset
grayling_df <- grayling_df %>%
mutate(log_mass = log(mass_g))
# Compare original and log-transformed distributions
p1 <- ggplot(grayling_df, aes(x = mass_g)) +
geom_histogram(bins = 15, fill = "steelblue", color = "white") +
labs(
title = "Original Mass Distribution",
x = "Total Mass (g)",
y = "Count"
)
p2 <- ggplot(grayling_df, aes(x = log_mass)) +
geom_histogram(bins = 15, fill = "steelblue", color = "white") +
labs(
title = "Log-Transformed Mass Distribution",
x = "Log(Mass(g)",
y = "Count"
)
# Display side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)# Compare means on original and transformed scales
mean_log_mass <- mean(grayling_df$log_mass)
back_transformed_mean <- exp(mean_log_mass)
cat("Arithmetic mean of original data:", round(mean_length, 1), "mm\n")Arithmetic mean of original data: 265.6 mm
Geometric mean (back-transformed mean of logs): NA mm
s
Histograms
Histograms show the frequency distribution of our data.
# Create a histogram
ggplot(grayling_df, aes(x = mass_g)) +
geom_histogram(bins = 15, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Fish Mass",
x = "Total Mass (g)",
y = "Count"
)# Histograms by lake
ggplot(grayling_df, aes(x = mass_g, fill = lake)) +
geom_histogram(bins = 15, position = "dodge", alpha = 0.7) +
labs(
title = "Distribution of Fish Mass by Lake",
x = "Total Mass (g)",
y = "Count"
)Box Plots
Box plots show the median, quartiles, and potential outliers.
# Create a box plot
ggplot(grayling_df, aes(y = total_length_mm)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Box Plot of Fish Lengths",
y = "Total Length (mm)"
)# Box plot by lake
ggplot(grayling_df, aes(x = lake, y = total_length_mm, fill = lake)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Box Plot of Fish Lengths by Lake",
x = "Lake",
y = "Total Length (mm)"
)The mean and median measure different aspects of a distribution:
Mean: Center of gravity of the distribution
Median: Middle value of the data
When a distribution is symmetric, the mean and median are similar. When it’s skewed or has outliers, they can differ significantly.
stuff
# Calculate summary statistics by lake
stats_by_lake <- grayling_df %>%
group_by(lake) %>%
summarise(
mean = mean(total_length_mm),
median = median(total_length_mm),
sd = sd(total_length_mm),
iqr = IQR(total_length_mm),
skewness = moments::skewness(total_length_mm)
)
# Display the results
kable(stats_by_lake, caption = "Comparison of Mean and Median by Lake", digits = 1)| lake | mean | median | sd | iqr | skewness |
|---|---|---|---|---|---|
| I3 | 265.6 | 266 | 28.3 | 24 | -0.9 |
| I8 | 362.6 | 373 | 52.3 | 61 | -1.1 |
The mean and median measure different aspects of a distribution:
Mean: Center of gravity of the distribution
Median: Middle value of the data
When a distribution is symmetric, the mean and median are similar. When it’s skewed or has outliers, they can differ significantly.
stuff
# Create a density plot with vertical lines for mean and median
ggplot(grayling_df, aes(x = total_length_mm, fill = lake)) +
geom_density(alpha = 0.5) +
geom_vline(data = stats_by_lake,
aes(xintercept = mean, color = "Mean"),
linetype = "dashed", linewidth = 1) +
geom_vline(data = stats_by_lake,
aes(xintercept = median, color = "Median"),
linetype = "solid", linewidth = 1) +
scale_color_manual(values = c("Mean" = "red", "Median" = "blue")) +
facet_wrap(~ lake, ncol = 1) +
labs(
title = "Density of Fish Lengths",
x = "Total Length (mm)",
y = "Density",
color = "Statistic"
)The standard deviation and interquartile range both measure spread, but:
Standard deviation: Sensitive to outliers
Interquartile range: Robust against outliers
When the data is approximately normal, the IQR ≈ 1.35 × standard deviation.
# Calculate the ratio of IQR to SD for our data
grayling_df %>%
group_by(lake) %>%
summarise(
sd = sd(total_length_mm),
iqr = IQR(total_length_mm),
ratio_iqr_sd = IQR(total_length_mm) / sd(total_length_mm)
) %>%
kable(caption = "Comparison of SD and IQR by Lake", digits = 2)| lake | sd | iqr | ratio_iqr_sd |
|---|---|---|---|
| I3 | 28.30 | 24 | 0.85 |
| I8 | 52.34 | 61 | 1.17 |
Percentiles are values that divide a dataset into 100 equal parts.
The 25th percentile is the first quartile (Q1)
The 50th percentile is the median
The 75th percentile is the third quartile (Q3)
The IQR is the difference between Q3 and Q1.
stuff
# Calculate percentiles
percentiles <- quantile(grayling_df$total_length_mm,
probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
# Display the percentiles
kable(
data.frame(
Percentile = c("10th", "25th (Q1)", "50th (Median)", "75th (Q3)", "90th"),
Value = percentiles
),
caption = "Key Percentiles of Fish Length (mm)",
digits = 1
)| Percentile | Value | |
|---|---|---|
| 10% | 10th | 251.1 |
| 25% | 25th (Q1) | 270.8 |
| 50% | 50th (Median) | 324.5 |
| 75% | 75th (Q3) | 377.0 |
| 90% | 90th | 408.6 |
Let’s examine how missing values affect our descriptive statistics by looking at the mass variable, which has some missing data.
[1] 2
# Calculate descriptive statistics with and without handling missing values
# Without handling (will produce NA results)
cat("Mean mass without handling NAs:", mean(grayling_df$mass_g), "g\n")Mean mass without handling NAs: NA g
# With handling missing values
cat("Mean mass with na.rm=TRUE:", mean(grayling_df$mass_g, na.rm = TRUE), "g\n")Mean mass with na.rm=TRUE: 351.2289 g
# Calculate descriptive statistics by lake, properly handling NAs
grayling_df %>%
group_by(lake) %>%
summarise(
mean_mass = mean(mass_g, na.rm = TRUE),
median_mass = median(mass_g, na.rm = TRUE),
sd_mass = sd(mass_g, na.rm = TRUE),
n_missing = sum(is.na(mass_g))
) %>%
kable(caption = "Mass Statistics by Lake (Handling Missing Values)", digits = 1)| lake | mean_mass | median_mass | sd_mass | n_missing |
|---|---|---|---|---|
| I3 | 150.5 | 147 | 42.2 | 0 |
| I8 | 483.7 | 490 | 176.5 | 2 |
Always check for missing values in your data before calculating statistics.
Use na.rm = TRUE when calculating summary statistics to handle missing values.
Report the number of missing values along with your statistics.
Consider whether the missing values are random or might introduce bias.
Now that we have estimates of the sample we need to relate that to the population
In reality, we rarely know the true population parameters. When studying fish in lakes I3 and I8:
Let’s demonstrate how different samples from the same population can give different estimates.
If we could sample all fish in the lake, we would know the true mean length. But that’s usually impossible in ecology!
Let’s take several random samples from Lake I3 and see how the sample means vary:
# Filter for Lake I3
i3_data <- grayling_df %>% filter(lake == "I3")
# Function to take a random sample and calculate the mean
sample_mean <- function(data, sample_size) {
sample_data <- sample_n(data, sample_size)
return(mean(sample_data$total_length_mm))
}
# Take 10 different samples of size 15 from Lake I3
set.seed(123) # For reproducibility
sample_size <- 15
sample_means <- replicate(10, sample_mean(i3_data, sample_size))
# Create a data frame with sample numbers and means
samples_df <- data.frame(
sample_number = 1:10,
sample_mean = sample_means
)
# Display the sample means
samples_df sample_number sample_mean
1 1 269.9333
2 2 260.6000
3 3 255.2000
4 4 263.4000
5 5 275.3333
6 6 279.2667
7 7 263.7333
8 8 273.6000
9 9 264.8000
10 10 269.8667
[1] 267.5733
[1] 7.346063
# Plot the different sample means
ggplot(samples_df, aes(x = factor(sample_number), y = sample_mean)) +
geom_point(size = 3, color = "blue") +
geom_hline(yintercept = mean(i3_data$total_length_mm),
linetype = "dashed", color = "red") +
annotate("text", x = 5, y = mean(i3_data$total_length_mm) + 2,
label = "Overall sample mean", color = "red") +
labs(title = "Means of 10 Random Samples from Lake I3",
x = "Sample Number",
y = "Sample Mean (mm)") +
theme_minimal()Notice how each sample’s mean differs from the overall mean. This demonstrates sampling variation.
The standard error of the mean (SEM) measures the precision of a sample mean as an estimate of the population mean.
Formula: \(SE_{\bar{x}} = \frac{s}{\sqrt{n}}\)
Where: - s is the sample standard deviation - n is the sample size
The standard error tells us: - How much uncertainty is in our estimate - How much sample means are expected to vary - How close our sample mean is likely to be to the true population mean
Remember: - Standard deviation (s) describes the variability in the individual data points - Standard error (SE) describes the variability in the sample mean itself - As sample size increases, SE decreases (more precise estimate)
Let’s calculate and visualize the standard error for both lakes:
# Calculate mean, SD, and SE for each lake
grayling_stats <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = mean(total_length_mm),
sd_length = sd(total_length_mm),
n = n(),
se_length = sd_length / sqrt(n)
)
# Display the statistics
grayling_stats# A tibble: 2 × 5
lake mean_length sd_length n se_length
<chr> <dbl> <dbl> <int> <dbl>
1 I3 266. 28.3 66 3.48
2 I8 363. 52.3 102 5.18
# Create a bar plot with error bars representing ±1 SE
ggplot(grayling_stats, aes(x = lake, y = mean_length, fill = lake)) +
geom_bar(stat = "identity", alpha = 0.7) +
geom_errorbar(aes(ymin = mean_length - se_length,
ymax = mean_length + se_length),
width = 0.2) +
labs(title = "Mean Fish Length by Lake with Standard Error",
subtitle = "Error bars represent ±1 standard error",
x = "Lake",
y = "Mean Length (mm)") +
theme_minimal()The sampling distribution of the mean is the theoretical distribution of all possible sample means of a given sample size from a population.
Important properties: 1. It is centered at the population mean (μ) 2. Its standard deviation is the standard error (σ/√n) 3. For large sample sizes, it approaches a normal distribution (Central Limit Theorem)
The larger the sample size: - The narrower the sampling distribution - The smaller the standard error - The more precise our estimate of the population mean
Let’s simulate the sampling distribution for Lake I3 fish data.
Let’s simulate taking many samples from Lake I3 to visualize the sampling distribution:
# Filter for Lake I3
i3_data <- grayling_df %>% filter(lake == "I3")
# Number of samples to simulate
num_simulations <- 1000
sample_size <- 20
# Simulate many samples and calculate means
set.seed(456) # For reproducibility
simulated_means <- replicate(num_simulations, sample_mean(i3_data, sample_size))
# Calculate the mean and standard deviation of the simulated means
mean_of_means <- mean(simulated_means)
sd_of_means <- sd(simulated_means)
# Create a data frame with the simulated means
simulated_df <- data.frame(sample_mean = simulated_means)
# Plot the sampling distribution
ggplot(simulated_df, aes(x = sample_mean)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
geom_vline(xintercept = mean(i3_data$total_length_mm),
linetype = "dashed", color = "red", size = 1) +
annotate("text", x = mean(i3_data$total_length_mm) + 2, y = 50,
label = "Full sample mean", color = "red") +
labs(title = "Simulated Sampling Distribution of the Mean",
subtitle = paste("Based on", num_simulations, "samples of size", sample_size),
x = "Sample Mean (mm)",
y = "Frequency") +
theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Notice that the simulated sampling distribution:
Is approximately normally distributed
Is centered around the overall sample mean
Has a spread that is related to the standard error
Let’s see how the standard error changes with different sample sizes:
# Define a range of sample sizes to test
sample_sizes <- c(5, 10, 20, 30, 50)
# For each sample size, simulate the sampling distribution and calculate SE
results <- data.frame()
for (size in sample_sizes) {
# Simulate many sample means for this sample size
simulated_means <- replicate(500, sample_mean(i3_data, size))
# Calculate the standard deviation of the sampling distribution (empirical SE)
empirical_se <- sd(simulated_means)
# Calculate the theoretical SE
theoretical_se <- sd(i3_data$total_length_mm) / sqrt(size)
# Add to results
results <- rbind(results, data.frame(
sample_size = size,
empirical_se = empirical_se,
theoretical_se = theoretical_se
))
}
# Display the results
results sample_size empirical_se theoretical_se
1 5 12.349407 12.657835
2 10 8.178270 8.950441
3 20 5.558957 6.328918
4 30 3.792177 5.167540
5 50 2.099744 4.002759
# Plot how SE changes with sample size
results_long <- pivot_longer(results,
cols = c(empirical_se, theoretical_se),
names_to = "se_type",
values_to = "standard_error")
ggplot(results_long, aes(x = sample_size, y = standard_error, color = se_type)) +
geom_line() +
geom_point(size = 3) +
scale_x_continuous(breaks = sample_sizes) +
labs(title = "Standard Error vs. Sample Size",
subtitle = "Standard error decreases as sample size increases",
x = "Sample Size",
y = "Standard Error",
color = "SE Type") +
theme_minimal()A confidence interval is a range of values that is likely to contain the true population parameter.
The 95% confidence interval for the mean is approximately:
\(\bar{x} \pm 2 \times SE_{\bar{x}}\)
This “2 SE rule of thumb” means: - The interval extends 2 standard errors below and above the sample mean - About 95% of such intervals constructed from different samples would contain the true population mean
Confidence intervals provide a way to express the precision of our estimates.
Let’s calculate and visualize the 95% confidence intervals for the mean fish length in each lake:
# Calculate 95% confidence intervals
grayling_ci <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = mean(total_length_mm),
sd_length = sd(total_length_mm),
n = n(),
se_length = sd_length / sqrt(n),
ci_lower = mean_length - 2 * se_length,
ci_upper = mean_length + 2 * se_length
)
# Display the confidence intervals
grayling_ci# A tibble: 2 × 7
lake mean_length sd_length n se_length ci_lower ci_upper
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 I3 266. 28.3 66 3.48 259. 273.
2 I8 363. 52.3 102 5.18 352. 373.
# Plot with confidence intervals
ggplot(grayling_ci, aes(x = lake, y = mean_length, fill = lake)) +
geom_bar(stat = "identity", alpha = 0.7) +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 0.2) +
labs(title = "Mean Fish Length by Lake with 95% Confidence Intervals",
subtitle = "Error bars represent 95% confidence intervals",
x = "Lake",
y = "Mean Length (mm)") +
theme_minimal()Let’s compare different ways of displaying uncertainty in our estimates:
# Calculate statistics for different types of error bars
grayling_error_bars <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = mean(total_length_mm),
sd_length = sd(total_length_mm),
n = n(),
se_length = sd_length / sqrt(n),
ci_lower = mean_length - 1.96 * se_length,
ci_upper = mean_length + 1.96 * se_length,
one_sd_lower = mean_length - sd_length,
one_sd_upper = mean_length + sd_length
)
# Create a data frame for plotting different error types
lake_i3 <- grayling_error_bars %>% filter(lake == "I3")
error_types <- data.frame(
error_type = c("Standard Deviation", "Standard Error", "95% Confidence Interval"),
lower = c(lake_i3$one_sd_lower,
lake_i3$mean_length - lake_i3$se_length,
lake_i3$ci_lower),
upper = c(lake_i3$one_sd_upper,
lake_i3$mean_length + lake_i3$se_length,
lake_i3$ci_upper)
)
# Plot the comparison
ggplot() +
geom_point(data = lake_i3, aes(x = "Mean", y = mean_length), size = 4) +
geom_errorbar(data = error_types,
aes(x = error_type, ymin = lower, ymax = upper, color = error_type),
width = 0.2, size = 1) +
labs(title = "Different Types of Error Bars for Lake I3",
subtitle = "Comparing standard deviation, standard error, and 95% confidence interval",
x = "",
y = "Length (mm)",
color = "Error Bar Type") +
theme_minimal() +
theme(legend.position = "none")In this lecture, we’ve explored:
These tools form the foundation of statistical analysis and will be essential as we move forward to more complex statistical methods.